2.5 2D distributions
2.5.1 Rectangular binning in plotly.js
The plotly package provides two functions for displaying rectangular bins: add_heatmap() and add_histogram2d(). For numeric data, the add_heatmap() function is a 2D analog of add_bars() (bins must be pre-computed), and the add_histogram2d() function is a 2D analog of add_histogram() (bins can be computed in the browser). Thus, I recommend add_histogram2d() for exploratory purposes, since you don’t have to think about how to perform binning. It also provides a useful zsmooth attribute for effectively increasing the number of bins (currently, “best” performs a bi-linear interpolation, a type of nearest neighbors algorithm), and nbinsx/nbinsy attributes to set the number of bins in the x and/or y directions. Figure ?? compares three different uses of add_histogram(): (1) plotly.js’ default binning algorithm, (2) the default plus smoothing, (3) setting the number of bins in the x and y directions.
p <- plot_ly(diamonds, x = ~log(carat), y = ~log(price))
subplot(
add_histogram2d(p) %>%
colorbar(title = "default", len = 1/3, y = 1) %>%
layout(xaxis = list(title = "default")),
add_histogram2d(p, zsmooth = "best") %>%
colorbar(title = "zsmooth", len = 1/3, y = 2/3 - 0.05) %>%
layout(xaxis = list(title = "zsmooth")),
add_histogram2d(p, nbinsx = 60, nbinsy = 60) %>%
colorbar(title = "nbins", len = 1/3, y = 1/3 - 0.1) %>%
layout(xaxis = list(title = "nbins")),
shareY = TRUE, titleX = TRUE
)2.5.2 Rectangular binning in R
In Bars & histograms, we leveraged a number of algorithms for computing the “optimal” number of bins in 1D by routing the results of hist() into add_bars(). There is a surprising lack of academic research and computational tools for the 2D analog. Among the research that does exist, solutions usually depend on characteristics of the unknown underlying distribution, so a sensible approach is to assume a Gaussian form (David W. Scott 1992). Practically speaking, that assumption is not very useful, but thankfully kernel density estimation, and specifically the kde2d() function from the MASS package provides a well-supported rule-of-thumb.
kde_count <- function(x, y, ...) {
kde <- MASS::kde2d(x, y, ...)
df <- with(kde, setNames(expand.grid(x, y), c("x", "y")))
df$count <- with(kde, c(z) * length(x) * diff(x)[1] * diff(y)[1])
data.frame(df)
}
kd <- with(diamonds, kde_count(log(carat), log(price), n = 30))
plot_ly(kd, x = ~x, y = ~y, z = ~count) %>%
add_heatmap() %>%
colorbar(title = "Number of diamonds")2.5.3 Basic heatmaps
The add_heatmap() function can also handle categorical x and y variables.
corr <- cor(diamonds[vapply(diamonds, is.numeric, logical(1))])
plot_ly(x = rownames(corr), y = colnames(corr), z = corr) %>%
add_heatmap() %>%
colorbar(limits = c(-1, 1))A heatmap with categorical axes is basically a two-way table. The Titanic dataset is a four-way table that we can put into a tidy data frame using the tidy() function from the broom package.
broom::tidy(Titanic) %>%
plot_ly(x = ~interaction(Age, Survived), y = ~interaction(Class, Sex)) %>%
add_heatmap(z = ~Freq)
#one_heatmap <- function(d) {
# plot_ly(d, x = ~Class, y = ~Sex) %>%
# add_heatmap(z = ~Freq) %>%
# add_annotations(
# ~unique(paste(Age, Survived, sep = ", ")), x = 0.5, y = 1,
# yanchor = "bottom", xanchor = "middle",
# xref = "paper", yref = "paper", showarrow = FALSE
# )
#}
#
#tidy(Titanic) %>%
# group_by(Age, Survived) %>%
# do(p = one_heatmap(.)) %>%
# subplot(
# nrows = 2, margin = c(0.01, 0.01, 0.04, 0.04),
# shareX = TRUE, shareY = TRUE,
# titleY = FALSE, titleX = FALSE
# )2.5.4 Contours
plot_ly(diamonds, x = ~cut, y = ~clarity) %>%
add_histogram2dcontour()